46 research outputs found

    Compute units in OpenMP: extensions for heterogeneous parallel programming

    Get PDF
    This article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit (CUs) abstraction is introduced in the OpenMP programming model. The Compute Unit abstraction is defined in terms of an aggregation of computing elements (e.g., CPUs, GPUs, FPGAs). On top of CUs, the article describes dynamic work sharing constructs and schedulers that address the inherent differences in compute power of host and device CUs. New constructs and the corresponding runtime support are described for the new abstractions. The article evaluates the case of a hybrid multilevel parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to CUs of different nature (GPUs and CPUs). All CUs are activated using the new extensions and runtime support. We compare hybrid and nonhybrid executions under two state-of-the-art work-distribution schemes (Static and Dynamic Task schedulers). On a computing node composed of one AMD EPYC 7742 @ 2.250GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 GPU AMD Radeon Instinct MI50 with 32GB, hybrid executions present speedups from 1.08 up to 3.18 with respect to a nonhybrid GPU implementation, depending on the number of activated CUs.This work was supported by the Spanish Ministry of Science and Technology (PID2019-107255GB).Peer ReviewedPostprint (published version

    Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

    Get PDF
    Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 Ă— GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10Ă— up to 3.5Ă— with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.This work was supported by the Spanish Ministry of Science and Technology (PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

    Get PDF
    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).Peer ReviewedPostprint (author’s final draft

    Exploiting pipelined executions in OpenMP

    Get PDF
    We propose a set of extensions to the OpenMP programming model to express point-to-point synchronisation schemes. This is accomplished by defining, in the form of directives, precedence relations among the tasks that are originated from OpenMP work-sharing constructs. The proposal is based on the definition of a name space that identifies the work parceled out by these work-sharing constructs. Then the programmer defines the precedence relations using this name space. This relieves the programmer from the burden of defining complex synchronization data structures and the insertion of explicit synchronization actions in the program that make the program difficult to understand and maintain. We briefly describe the main aspects of the runtime implementation required to support precedence relations in OpenMP. We focus on the evaluation of the proposal through its use two benchmarks: NAS LU and ASCI Seep3dThis research has been supported by the Ministry of Science and Technology of Spain and the European Union (FEDER funds) under contract TIC2001-0995-C02-01.Peer ReviewedPostprint (author's final draft

    Complex pipelined executions in OpenMP parallel applications

    Get PDF
    This paper proposes a set of extensions to the OpenMP programming model to express complex pipelined computations. This is accomplished by defining, in the form of directives, precedence relations among the tasks originated from work-sharing constructs. The proposal is based on the definition of a name space that identifies the work parceled out by these work-sharing constructs. Then the programmer defines the precedence relations using this name space. This relieves the programmer from the burden of defining complex synchronization data structures and the insertion of explicit synchronization actions in the program that make the program difficult to understand and maintain. This work is transparently done by the compiler with the support of the OpenMP runtime library. The proposal is motivated and evaluated with a synthetic multi-block example. The paper also includes a description of the compiler and runtime support in the framework of the NanosCompiler for OpenMP.This research has been supported by the Ministry of Education of Spain under contract TIC98-511 and the CEPBA (European Center for Parallelism of Barcelona).Peer ReviewedPostprint (author's final draft

    Experiences parallelizing a web server with OpenMP

    Get PDF
    Multi-threaded web servers are typically parallelized by hand using the pthreads library. OpenMP has rarely been used to parallelize such kind of applications, although we foresee that it can be a great tool for network servers developers. In this paper we compare how easy is to parallelize the Boa web server using OpenMP, compared to a pthreads parallelization, and the performance achieved. We present the results of a parallelization based on OpenMP 2.0, the dynamic sections model and pthreads.Peer ReviewedPostprint (published version

    Runtime address space computation for SDSM systems

    Get PDF
    This paper explores the benefits and limitations of using a inspector/executor approach for Software Distributed Shared Memory (SDSM) systems. The role of the inspector is to obtain a description of the address space accessed during the execution of parallel loops. The information collected by the inspector will enable the runtime to optimize the movement of shared data that will happen during the executor phase. This paper addresses the main issues that have been considered to embed an inspector/executor model in a SDSM system: amount of data collected by the inspector, the accurateness of this data when the loop has data and/or control dependences, and the computational overhead introduced. The paper also includes a description of the SDSM system where the inspector/executor model has been embedded. The proposal is evaluated with four applications from the NAS benchmark suite. The evaluation shows that the accuracy of the inspection and the small overheads introduced by the approach allow its use in a SDSM system.Peer ReviewedPostprint (published version

    Coarse grain parallelization of deep neural networks

    No full text
    Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the critical step of training the neural network. This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate. The parallelization has been implemented in Caffe, a state-of-the-art DNN framework. The paper describes the code transformations for the parallelization and we also identify the limiting performance factors of the approach. We show competitive performance results for two state-of-the-art computer vision datasets, MNIST and CIFAR-10. In particular, on a 16-core Xeon E5-2667v2 at 3.30GHz we observe speedups of 8x over the sequential execution, at similar performance levels of those obtained by the GPU optimized Caffe version in a NVIDIA K40 GPU.Peer ReviewedPostprint (published version

    Coarse grain parallelization of deep neural networks

    No full text
    Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the critical step of training the neural network. This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate. The parallelization has been implemented in Caffe, a state-of-the-art DNN framework. The paper describes the code transformations for the parallelization and we also identify the limiting performance factors of the approach. We show competitive performance results for two state-of-the-art computer vision datasets, MNIST and CIFAR-10. In particular, on a 16-core Xeon E5-2667v2 at 3.30GHz we observe speedups of 8x over the sequential execution, at similar performance levels of those obtained by the GPU optimized Caffe version in a NVIDIA K40 GPU.Peer Reviewe

    Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks

    Get PDF
    GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution: memory allocation and work distribution. Memory allocation determines the data mapping to GPUs, and therefore conditions all work distribution schemes and communication phases in the application. Unified Virtual Memory simplifies the codification of memory allocations, but its effects on performance depend on how data is used by the devices and how the devices' driver is going to orchestrate the data transfers across the system. In this paper we present a multi-GPU and Unified Virtual Memory (UM) implementation of the NAS Multi-Zone Parallel Benchmarks which alternate communication and computation phases offering opportunities to overlap these phases. We analyse the programmability and performance effects of the introduction of the UM support. Our experience shows that the programming efforts for introducing UM are similar to those of having a memory allocation per GPU. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our UM-based parallelization outperforms the manual memory allocation versions by 1.10x to 1.85x. However, these improvements are highly sensitive to the information forwarded to the devices' driver describing the most convenient location for specific memory regions. We analyse these improvements in terms of the relationship between the computational and communication phases of the applications.This work was supported by the Spanish Ministry of Science and Technology (PID2019-107255GB).Peer ReviewedPostprint (published version
    corecore